Detection and localization of caries and hypomineralization on dental photographs with a vision transformer model

Caries and molar-incisor hypomineralization (MIH) are among the most prevalent diseases worldwide and need to be reliably diagnosed. The use of dental photographs and artificial intelligence (AI) methods may potentially contribute to realizing accurate and automated diagnostic visual examinations in the future. Therefore, the present study aimed to develop an AI-based algorithm that can detect, classify and localize caries and MIH. This study included an image set of 18,179 anonymous photographs. Pixelwise image labeling was achieved by trained and calibrated annotators using the Computer Vision Annotation Tool (CVAT). All annotations were made according to standard methods and were independently checked by an experienced dentist. The entire image set was divided into training (N = 16,679), validation (N = 500) and test sets (N = 1000). The AI-based algorithm was trained and finetuned over 250 epochs by using image augmentation and adapting a vision transformer network (SegFormer-B5). Statistics included the determination of the intersection over union (IoU), average precision (AP) and accuracy (ACC). The overall diagnostic performance in terms of IoU, AP and ACC were 0.959, 0.977 and 0.978 for the finetuned model, respectively. The corresponding data for the most relevant caries classes of non-cavitations (0.630, 0.813 and 0.990) and dentin cavities (0.692, 0.830, and 0.997) were found to be high. MIH-related demarcated opacity (0.672, 0.827, and 0.993) and atypical restoration (0.829, 0.902, and 0.999) showed similar results. Here, we report that the model achieves excellent precision for pixelwise detection and localization of caries and MIH. Nevertheless, the model needs to be further improved and externally validated.


INTRODUCTION
Caries is among the most prevalent non-communicable diseases in all age groups worldwide 1,2 , and developmental disorders such as molar-incisor hypomineralization (MIH)-synonymously named "chalky teeth"-are of additional clinical relevance, especially in younger populations 3 .Both entities need to be reliably diagnosed by dental professionals.Here, a visual examination (VE) must be recognized as the method of choice for caries and MIH detection due to its simplicity, rapidness, and documented validity [4][5][6][7][8][9] .However, when considering the documented diagnostic variability between different dentists or work groups 5,6 , it can be argued that the reliability of VE can be improved and should optimally be as objective as possible.Following this aim, the use of teeth photographs-a digital and machine-readable equivalent to a clinical examination-and artificial intelligence (AI) methods may potentially contribute to accurate diagnostic evaluations in the future.Recently, a few study groups have used and evaluated convolutional neural networks (CNNs) with digital photographs for the detection of caries [10][11][12] and MIH 13,14 .All studies proved the concept of using AI-based methods for dental photographs, and promising results were published.While a publicly accessible model would enable an independent evaluation by other research groups, no such model has been introduced thus far.Most recently, vision transformer networks were introduced as an alternative to established CNNs for various image recognition tasks 15 .Considering their computational efficiency and accuracy, it might be possible that transformers may outperform current CNN standards in the future.AI-based solutions for detecting pathologies, including caries and MIH, should optimally be based on this new technology, which has rarely been applied in medicine and dentistry until now [16][17][18][19] .
Therefore, the present study first aims to develop a transformerbased model to achieve precise and simultaneous pixelwise detection and localization of relevant caries and MIH classes from dental photographs.Second, it is hypothesized that the model could achieve an accuracy of at least 98% and an average precision of 0.5 for the detection and localization of caries and MIH classes.The final study aim is to make the AI-based model publicly accessible as a web application.

Data set
In the complete image set, 34,710 pathological findings belonging to the caries (N = 26,360) and MIH (N = 8350) entities were detected and classified.Non-cavitated caries lesions and dentin cavities were found to be the most frequent caries classifications.Hypomineralized teeth were predominantly characterized by demarcated opacities and enamel disintegrations.Detailed distributions among classifications in relation to the training, validation and test sets can be observed in Table 1.

Model performance
The highest pixel numbers for caries were documented for noncavitated lesions and dentin cavities (Table 2).In contrast, the pixel counts for grayish translucencies and enamel breakdowns were lower by factors of ~30 and ~50, respectively (Table 2).In the MIH entity, demarcated opacities were labeled most often, followed by atypical restorations and enamel disintegrations (Table 2).The diagnostic performance in terms of F1-score, IoU, AP and ACC for each class can also be observed in Table 2. Notably, even after baseline training, most of the IoU values were above 0.4 (Table 2), except caries-related grayish translucencies (IoU = 0.210) and enamel breakdowns (IoU = 0.088).The IoU values increased up to ~0.8 after finetuning.However, caries-associated enamel breakdowns (IoU = 0.352) and enamel disintegrations due to MIH (IoU = 0.507) remained lower than all others (Table 2).The model's overall IoU value was 0.959 after finetuning.When considering the AP, the same pattern emerged (Table 2 and Fig. 1).After baseline training, the AP values ranged between 0.420 and 0.751 for the caries classes and between 0.657 and 0.704 for the MIH classes.The model performance once again increased after finetuning for caries (0.588-0.882) and MIH (0.669-0.902); the overall AP reached 0.977.The ACC on pixel level was found to be constantly high throughout baseline training as well as finetuning and exceeded -with one exception-values above 0.99.The overall ACC was 0.978 after finetuning (Table 2).
In addition to the pixelwise analysis (Table 2), Table 3 summarizes the model performance for caries and MIH detection on an image level.The overall diagnostic ACC values were found to be high, with numbers above 95%.SE and SP ranged between ~80% and ~100%.Only in the case of caries-related enamel breakdowns was low SE documented (Table 3).

DISCUSSION
This study developed and evaluated an AI-based diagnostic model for the detection, classification, and localization of caries as well as MIH in professionally captured clinical photographs of teeth.Furthermore, the model was made openly accessible as a web application (http://demo.dental-ai.de).In particular, the use of precise object labeling in a large image set and pixelwise image analysis utilizing a transformer network with a segmentation head resulted in a model that can simultaneously identify different  pathologies, including subscores, from dental photographs (Fig. 2).
The comparison and interpretation of the shown data for pixelwise analysis (Table 2 and Fig. 1) is limited at the moment, simply due to the lack of technically comparable projects in dentistry.However, the following discussion should give an overview of the recent state of knowledge.In general, the transformer model achieved an overall ACC value of 0.978 at the pixel level, and in the majority of the included diagnostic categories, an ACC value >0.99 was reached.In the case of non-  cavitated caries, the ACC was 0.99.It can be concluded that the ACC was very high, which is in line with the available literature evaluating transformers [16][17][18] , and finally, the initially formulated goal was reached.When comparing the documented ACC values (>95%) from the image-related analysis (Table 3) to those from previously published data using CNNs, ACC values of approximately 90% were achieved for caries [10][11][12] and MIH detection 13,14 .This comparison indicates that the use of exact annotations and a powerful transformer network, as well as other improvements such as pixelwise analysis and the inclusion of commonly used caries and MIH categories, may surpass CNN-based algorithms in terms of diagnostic performance.Nevertheless, it should be noted that misclassification is possible and might predominantly be linked to lesions of smaller size.
In terms of the AP, the anticipated value of 0.5 was even exceeded after finetuning, with individual values of up to 0.902 (Table 2 and Fig. 1).These values match those of other current studies in medicine 17,18 and dentistry 16 for radiographs.Interestingly, the AP and IoU values may be influenced by the overall pixel number and depend further on the number of annotations.In other words, all high-frequency categories with a large pixel quantity, e.g., non-cavitated caries, dentin cavity caries or MIHrelated opacities (Tables 1 and 2), were found to be associated with higher AP and IoU values.In contrast, less-frequent categories with small-sized lesions, e.g., caries-related grayish translucencies and enamel breakdowns as well as MIH-related enamel disintegrations (Tables 1 and 2), were generally linked to lower AP and IoU values.This finding might be explained by the small-sized lesions and possible edge inaccuracies that potentially occur during labeling.Considering the latter aspect, it is inevitable that the manually drawn labels around any pathology will also contain pixels of sound dental hard tissue.This may confuse the model during training and affect its accuracy in general, possibly more severely in cases of less frequent and small-sized dental defects.To overcome this issue and further improve diagnostic performance, a continuous increase in the number of images, especially those of the previously mentioned pathologies, should be carried out.Consequently, future research is needed to address this issue.
In medicine, transformer-based AI algorithms have predominantly been used for language or text recognition and processing tasks 19 .Meanwhile, they have also been used for object detection [15][16][17][18] .The use of a transformer network with a segmentation head (following the SegFormer architecture) has the advantage that diagnostic decisions of the model can be made on the pixel level.Classification and localization are thus unified in one step, and a segmentation map, which may contain multiple diagnoses for the image at once and allows size and location estimation, is generated.Due to the available hardware resources, it was possible to process all images with an appropriate resolution, which probably contributed to the precision of the developed algorithm.This study has several strengths and limitations.The sizeable number of dental photographs (N = 18,719) combined with the fact that all images were individually annotated pixelwise and counterchecked by trained and calibrated dentists according to widely accepted classification systems must be highlighted as fundamental features.The utilized image augmentation procedures may have contributed substantially to the fact that there was a continuous increase in the diagnostic performance over 250 training epochs; thus, almost no overfitting was observed (Fig. 1).The inclusion of multiple image classes from ImageNet (Fig. 3) during the training process may have supported the robustness and generalizability of the model.This led to the fact that only the desired dental findings became detectable instead of mistakenly interpreting similar pixel patterns on other image classes as dental defects (Fig. 2 and Tables 2 and 3).When discussing the potential limitations of this study, the image dataset has to be considered first.At the present stage, it can be assumed that the diagnostic performance might be equal in populations that are similar to those in the dataset, e.g., Caucasian children, adolescents and adults.In contrast, the evaluation of teeth from other ethnic populations or regions might possibly be lacking due to the known differences in the clinical appearance of teeth.Therefore, it would be essential to conduct external validation studies aimed at assessing the model performance independently from the used dataset.Furthermore, not all types of dental restoration or developmental or genetically determined disorders that affect teeth have been included in the model thus far.Consequently, the dataset and model need to be extended steadily to cover the spectrum of prevalent and rare dental pathologies as well as restorations as best as possible.Another limitation seems to be that the dataset consists of only high-quality dental photographs.Considering that images captured by various intraoral cameras, semiprofessional cameras or even mobile phones can also potentially be analyzed by the developed algorithm, the importance of proper image quality needs to be highlighted.This includes not only technical properties, e.g., correctly exposed und uncompressed images with an appropriate high resolution but also the ideal photographic representation of the object of interest.Therefore, it seems to be important, first, to safeguard high photographic image quality and, second, to include suboptimal images in future training sets.Such aspects require additional research.These technical aspects are also of importance and may influence and potentially limit the automatized feedback by the segmentation model when uploading own images of low quality.
In conclusion, the present diagnostic study demonstrated excellent model performance in detecting and localizing different caries and MIH classes from professional dental photographs.The study aim was reached by using a large image set with precise object annotations, image augmentation, and a transformer network.Nevertheless, the model needs to be further improved and evaluated under clinical conditions.

Ethical approval and reporting
This study on caries detection by AI-based methods was approved by the Ethics Committee of the Medical Faculty of the Ludwig-Maximilians University of Munich (project number 020-798).This study used anonymized intraoral photographs from earlier conducted investigations or from clinical situations in which images were taken for educational purposes.With respect to this, we were unable to identify any patients, and therefore, no written informed consent was possible.This investigation was reported following the recommendations of the Standards for Reporting of Diagnostic Accuracy Studies (STARD) steering committee 20 and recently published recommendations for designing and conducting studies using AI methods in dental research 21 .

Digital dental photographs
All clinical photographs were taken using standard procedures by experienced dentists (JK, RHW) over a period of more than ten years.In brief, clinical image acquisition included the use of professional single reflex lens cameras (Nikon D200, D300, D7100 or D7200, Nikon, Tokyo, Japan) equipped with a macro lens (Nikon AF-S Micro Nikkor 105 mm 1:2.8 G, Nikon, Tokyo, Japan) and a macro flash (EM-140DG, Sigma, Rödermark, Germany) after tooth cleaning and drying.Posterior teeth were photographed indirectly using intraoral mirrors 10,14,22,23 .
All available dental photographs from occlusal and freely accessible surfaces were processed anonymously.Aiming at safeguarding high image quality in the whole image set, insufficient photographs, e.g., over/underexposed, distorted or blurred images, were excluded.All included single tooth photographs were standardized according to the following parameters: aspect ratio of 1:1, resolution of 1200 × 1200 pixels with no compression, jpeg format and RGB color space.Thus, most of the included images were cropped and/or rotated by use of professional photo editing software (Affinity Photo, Serif, Nottingham, UK) until the tooth surface filled most of the frame.The dental image set included a broad spectrum of teeth that ranged from healthy to severely destroyed due to caries and MIH.Photographs with dental restorations, sealants, orthodontic appliances, or teeth with rare dental diseases, e.g., amelogenesis imperfecta or dentinogenesis imperfecta, were not excluded from the dataset.Finally, the image set comprised 18,179 single tooth photographs (4483 primary and 7699 permanent posterior teeth; 2339 primary and 3658 permanent anterior teeth).This sample represented the largest available number of single tooth photographs, which were further completed with high-quality annotations aiming at increasing the model performance.

Dental pathology annotation (reference standard)
The anonymized image set was stored and processed on a university-based computer cloud to enable pixelwise labeling with the open source, web-based Computer Vision Annotation Tool (CVAT, server version 2.0, core version 4.2.1, Intel, Santa Clara, CA, USA).Initially, all images were split into five equal subsamples and were annotated by five trained and calibrated dental graduates (M.F., A.S., P.E., J.S., F.Z.).In case of questionable findings regarding detection, classification and size, these images or pathologies were re-examined and discussed with the experienced dentist (J.K., >20 years of clinical practice and scientific experience) until consensus on each diagnostic decision was reached.In another cycle, all annotations in terms of classifications and marked areas were independently checked and-if necessary-corrected by an experienced dentist (J.K.) with the aim of ruling out potential errors or misclassifications.The detection and classification of caries and MIH was made in agreement with widely accepted diagnostic scoring systems [24][25][26][27][28][29] .In detail, when a caries lesion was visually detectable in a clinical image, its location was annotated and classified according to the following scores: 1-non-cavitated caries lesion (first sign and established lesion), 2-grayish translucency, 3-localized enamel breakdown, 4-caries-related cavitation (dentin exposure and large cavity) and 5-largely/ severely destroyed tooth with almost complete loss of the crown [24][25][26][27][28] .The following criteria were applied for chalky teeth detection: 1-demarcated opacity (hypomineralization/chalky tooth area with intact tooth surface), 2-enamel disintegration (hypomineralized hard tissue with enamel breakdown or dentin exposure) and 3-MIH-related restoration 29 .Each single tooth photograph could have multiple diagnostic findings (Table 1), which were annotated separately from each other.All dental annotations served as reference standards and were later used for cyclic training and evaluation of the transformer-based model.
Prior to the study, over the course of a 2-day workshop, all participating dentists were explicitly instructed in the field of dental diagnostics by the principal investigator (J.K.).The scoring reliability of all annotators regarding the detection and classification of caries and MIH was determined by diagnosing 140 single tooth photographs.The corresponding Kappa values for the intraand inter-examiner reproducibility of the dental annotators (M.F., A.S., P.E., J.S., F.Z.) were found to be good to excellent for caries (intra: 0.858-1.000;inter: 0.656-0.837)and MIH (intra: 0.836-1.000;inter: 0.693-0.886).Permanent mutual exchange of knowledge between all annotators and the principal investigator was possible at any time during the study project.Furthermore, the dental work group had frequent and regular meetings to enable constant and proper decision making.

Vision transformer-based model development (test method)
The AI-based algorithm for the detection, classification and localization of caries and MIH was trained using a pipeline of methods, mainly including image augmentation and the adaptation of a transformer network.Before training, the entire image set of single tooth photographs (N = 18,179 images) was randomly divided into a training set (N = 16,679), validation set (N = 500) and test set (N = 1000).With respect to the large image set, a test sample size of 1000 photographs with 1993 annotations (Table 1) was justified as appropriate to enable extensive model training and rigorous evaluation.The test set was not made available to the machine learning model as training material; it only served as an independent test set.The detailed composition of the image set in relation to registered pathologies is shown in Table 1.
Machine learning models require a large and variable number of training images to achieve excellent diagnostic performance.In

Fig. 1
Fig. 1 Average precision (AP) in relation to the training progress for the caries and MIH categories.All lines in graphs are plotted over 250 epochs.

Fig. 2
Fig. 2 Examples of clinical images and the corresponding outputs by the segmentation model.The description and the corresponding false coloured segments indicate the diagnostic category.

Fig. 3
Fig. 3 Examples of augmented images that were continuously generated during the training process.More than four million augmented images were used to train the vision transformer-based model over 250 epochs.

Table 1 .
Overview of the included pixelwise annotations for the

Table 2 .
Diagnostic performance of the transformer-based model on a pixel level after 250 training epochs and additional finetuning.